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Abstract 

In this paper we will derive a new algorithm for Internet searching. The 
main idea of this algorithm is to extend the existing algorithms by a com- 
ponent, which reflects the interests of the users more than existing methods. 
The "Vox Populi Algorithm" (VPA) [1] creates a feedback from the users to 
the content of the search index. The information derived from the users query 
analysis is used to modify the existing crawling algorithms. The VPA con- 
trols the distribution of the resources of the crawler. Finally, we also discuss 
methods of suppressing unwanted content (spam). This is necessary in order 
to enable an efflcient performance of the VPA. 







The retrieval of relevant information from data sources with a very complex 
structure has become a challenging task since the number of documents in the In- 
ternet has reached a level of about multi billions of documents. Only a small part 
of them is visible in search engines. The problem of organizing and structuring 
these data into catalogues or searchable databases is of theoretical and significant 
practical (commercial) interest. 

Let us define the basic components for the mathematical description of the interests 
of the users, the relevancy of the search results and the crawling process. The users 
of search engines express their needs for information through the queries which thoy 
address to a searchable database (index) /. Each of the k queries consists of one or 
more keywords q addressed to this index. It will be presented as: 

qk = {qi,...,qn)k (1) 

n is the length of the query k. The number of keywords per average query is n « 2 
(status in 2003). The users are searching for documents dj (HTML pages, tables, 
text processing documents, pictures, multimedia files, ...) containing information. 
These docimicnts are grouped (organized) in domains Dk presenting sets of docu- 
ments under a common editorial responsibility and address (URL): 

Dfe = (J dj''^ Uk = number of documents in Dk (2) 

The number of domains is about 6.4 million in Germany [2] and the number of 
documents per domain rife is in the interval 10°- -^. 



Each document d contains searchable information, today limited to text informa- 
tion. Content, which is hidden for the todays search technology in non indexable 
formats (bitmaps, scripts etc.) will be neglected here and in the following. A docu- 
ment is characterized by the content of keywords q and the position of the keyword 
in certain format elements e, (metatags, headers, tables, link text etc.): 

rf^''^ =/(«!, 92,..., 61,62,...) (3) 

During the crawling and indexing process, the image of the document d in the 
searchable index / contains a reduced set of information - the keywords and their 
position in the format elements e of the document. When a query is addressed to the 
index I a ranking algorithm generates a set of documents (links) which is ordered by 
the relevancy of the found documents. In order to describe the document ranking 
process which generates the set of results on each query, one has to introduce the 
density p of keywords within the documents: 

where Uq. is the number of the occurrences of the keyword qi in the format element 
d and rip, is the total number of words in this format element. 



Today there exist two basic types of ranking algorithms - the dynamic and the 



1 



static ranking algorithms. The dynamic rank of a document depends on two factors 
only - the keywords q of the query and the information content of the documents. 
Expressed in a "thumb rule": the higher the keyword density in the document the 
higher is the dynamic rank of this document. The relevancy function R^, defining 
the dynamic rank of a document, can be written as: 

N 

■Rd(9i) cx ^^Mfc N - number of format elements e (5) 

fe=i 

for a single keyword query. The coefficients /ife arc free parani(rtc;rs, defining the 
importance or weight of each format element. For example, the occurrence of a 
keyword in an URL is usually much more important than in the text itself ^iurl > 
fJ-text- Queries with multiple keywords can be written as superpositions of single 
keyword queries: 

R2{qi,q2,...,qn) = R\qi)R\q^) ■ ... • R\qn) (6) 

Usually these functions become modified for different purposes, such as suppression 
of unwanted information (spam). Other modifications can take into account the 
freshness of the document, the type of the format or other technical parameter. 

The practical work on search engines has shown that using only a document re- 
lated, dynamical ranking algorithm is insufficient. In order to also include the 
importance or the popularity of a domain (popularity among the webmasters not 
necessarily among Internet users), a new type of algorithms was invented - the static 
ranking [3]. The static rank Rg of a document di is related to the importance of the 
corresponding domain, where it is located. The idea of the static rank of a domain 
D can be expressed symbolically in the following form: 

R,{D)^Y.^i (7) 

where the Ri is the static rank of the sites linking to the domain D. Nj is the total 
amount of external links to a Domain. In [4] a more detailed definition of the page 
rank formula is given: 

Rs {D) = {1 - d) + dJ2 Ri^T^ (8) 

where d is a free parameter (usually in the region d 0.85 [4]) and Mj is the total 
number of outgoing links of the referring site. A detailed discussion of the page 
rank algorithm used by Google is also found in [5] and [6] . 

The resulting rank of a document is a function of the the dynamic rank (5) and 
the static rank (7). There is no unique or even optimal way of constructing this 
function. A reasonable way is to choose the resulting relevancy R^s as a product of 
the dynamic and static rank: 

Rde = Rd{q) ■ Rs{di) (9) 
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Analyzing (9) a usual approach would be using Rs{Di) instead of Rs{di). In practice 
the static rank of a document depends not only on the static rank of the domain D 
containing di, but also on the position in the domain (link topology of the domain). 
At present this kind of search algorithms is in use in every major internet search 
engine. 

The algorithms described above do indeed meet the needs of the users. This ap- 
proach is reasonable from an academic point of view and it has produced remarkable 
results in the past. Today it has become more difficult to make use of the link topol- 
ogy - very often the links arc not set according to the content relevancy, but for other 
(economic) reasons. To the extent that the search engines have become the most 
important information retrieval tool, they have also become a target of spamming 
(site owners try to fake the search engines, virtually presenting more important 
content than there really is). An effective method of detecting a certain type of 
spam is described in the appendix. Applying filter mechanisms and modifying the 
parameters of the dynamic and the static relevancy algorithms, one can "fine tune" 
the quality of the Internet search engines. 

The two methods described above explicitly do not take into account the most 
important factor, the interest of the users searching for information. The dynamic 
and the static relevancy of a document are influenced by the content of the site and 
by the "citation" by other sites. There is no methodical component, that reflects 
the voice of the searching people. This will be done by the "Vox Populi Algorithm" 
(people's voice). 

The main idea of the VPA is to use the information that is extractable from the 
user query analysis to enhance the quality of the search. This can be done in two 
different ways, by modifying either the ranking or the crawling algorithm. In this 
paper the focus is not on the ranking, but on the crawling algorithm. The crawling 
algorithm defines which domain and how much of the content will be included into 
the search index. Sites which are not included cannot be found by the best ranking 
algorithm. At present there is only a small fraction (< 10%) of the Internet sites 
indexed by the search engines. The much bigger part of the Internet ("Deep Web") 
is not visible in any of the search engines. 

The source of information is the analysis of the queries q, reflecting the users in- 
terests and needs. The query set Q may contain all single and multiple keyword 
queries of the users (1). Based on these queries a multidimensional tensor Q can be 
defined, containing the information of the multiple keyword correlations with the 
dimension N^ax- 

dim[n{Q)] = Nma. (10) 

Nmax is the maximum length of a query - theoretically it can be infinite. Practically 
the amount of queries having > 6 keywords is < 1%, while the average query consists 
of about N = 2 keywords. In order to simplify the further calculations one can 
reduce the dimension of (10) in the following way: 

17^— (Q) ^17^=2 (Q) = 17 (11) 

In this reduction algorithm, the queries with more than two keywords are replaced 
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by two keyword queries, containing all possible paired combinations. For example, 
a three keyword query is equivalent to 3 two keyword queries and so on. 



The matrix O is a correlation matrix of all keywords of the query set Q, which 
is analyzed, is a positive and symmetric matrix ^. One can calculate the eigen- 
vectors and eigenvalues of O, transforming it into the diagonal form: 

K-^VLK = n^^s (12) 

The details of the diagonalization procedure arc well known, sec [8] or any other 
standard textbook on mathematics. It is now important to understand the practical 
meaning of the matrices K and 0''"'^ . The matrix K consists of eigenvectors which 
are keyword combinations: 



(13) 



K = 

\ ■■■ ) 

where each eigenvector has the coordinates 

= (cigi, 0292, ■••)■' (14) 

similar to the definition (1) the qi are the keywords and the coefficients are 
positive numbers, giving each keyword some "weight" compared to the other ones 
(How frequent do the users ask for this keyword?). The coefficients determine the 
relative importance of a keyword within an eigenvector. A typical eigenvector (or 
better "eigenquery" ) has the form (based on the data [7], Aug. 2003). 

e' = {"mp3'\ 0.73 • ''downloads'', 0.43 • "/^ee", ...) (15) 

This query shows how the average user is asking, when he is searching for mp3 
downloads at no cost. The reduced {N = 3) keyword matrix of the example above 
has the form [7]: 

mp3 download free 
mp3 37.2% 8.8% 2.7% 

download 8.8% 19.2% 3.6% 
free 2.7% 3.6% 13.4% 

The difference between the typical keyword search at present and our approach is 
that the words here have different weights, determining their relative importance 
for the users. 

Another important information about the significance of keyword combinations is 

^Thc analysis of the order of the keywords shows a statistical asyiimictry for the order of 
keywords A'^(l,2) ^ Af(2, 1). Users interested in the explicit order of the keywords can use the 
option called "Exact Phrase", which is available on any modern search engine. Therefore it is 
reasonable to assume that the order of the keywords is not important for the users when they make 
simple queries (more than 90% of all queries are of this type) . We will use here the approximation 
(1.2) = (2.1) 
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contained in the matrix fi'''"^. 
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Each eigenvalue corresponds to an eigenvector in (14). The eigenvahie can be 
interpreted as the importance of the corresponding eigenvector - it defines the im- 
portance of an eigenquery for the users. 



Finally, we have developed the tools for defining how a search engine can use the 
information of the users to determine, which content should be enhanced or reduced 
in the index. Based on the described algorithm it is possible to define which content 
is the "most wanted" content and which sites deliver this type of content: 

(ciQi + 02^2 + ■•■) search engine list of ranked domains 

Crawling the Internet, each domain is given certain resources by the search en- 
gine, such as CPU time and memory in the index (alternatively also the number 
of crawled documents or other parameters, depending on the settings of the search 
engine). 

The practical realization of the VPA as an extension of an existing Internet search 
could be performed using the following procedure: 

1. Generate a ranking of domains, addressing the eigenqueries (14) to the existing 
(old) search index, the priority of those domains is defined by the size of 
eigenvalues. (17). 

2. Modify the existing resource ranking list with respect to these eigenvalues. 

3. Use the new determined ranking of the domains for crawling the Internet 
according to the modified resource distribution. 

4. Repeat the cycle. 

In order to determine which sites best fit the eigenqueries, it is useful to calculate a 
dynamic rank for a whole domain, not just for a single document. A simple method 
would be to summarize the total score of all documents in one domain: 

RD(ei)oc5^i?^(ei) (18) 

fc=i 

Let us assume, that the amount of resources (CPU time, number of documents, 
data volume etc.) given to each domain, when crawling it, can be expressed in a 
function M, with 

M = M{Dk,R,,...) (19) 
In order to apply the VPA one can modify (19) in the following way: 

M ^M = M ■ RvpA (20) 
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The function Rvpa defines the VPA correction witli regard to the old crawling 
algorithm. The function Rvpa can be presented in different ways. The basic 
requirement for the function is that it is monotone concerning the parameters Aj, 
which define quantitatively how relevant a query is for the usc;rs. Following Occam's 
principle of simplicity (Pluralitas non est ponenda sine neccesitate - Entities should 
not be multiplied unnecessarily) this function should use only a minimum set of free 
parameters, which will allow the adoption (or "fine tuning") the algorithm to the 
local requirements: 



The parameter a and j3 can be chosen freely. In the limit, the new algorithm 
generates the existing results in (20). 



In this paper we have shown how the analysis of queries can be used to enhance the 
relevant and "most wanted" content in a search index. In this way the relevancy, 
experienced by the users of the search should grow - the users will find more of what 
they are interested in. The existing system of the relevancy ranking of documents or 
domains can remain unchanged. The algorithm will not replace existing crawling 
and ranking algorithms, but the VPA will extend them by a qualitatively new 
component. 




(21) 



lim M = M 



(22) 
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Appendix 



The static rank algorithm has also become the target of spamming (for exam- 
ple, "Google bombing" [11]). This means that webmasters are creating clusters 
of domains, which consist of very similar sites, referring to a single domain or a 
document. This kind of spam cluster can consist of many domains, which do not 
contain any valuable content at all. Because of this the static rank consequently is 
becoming more and more a measure of the marketing budget or the cleverness of 
the webmaster of a domain, rather than a measure of "real" reputation or content 
quality. As a result of this development, the importance of the static rank as a tool 
for determining the quality or the relevancy of a site is decreasing. 

We want to propose an algorithm which identifies this kind of spamming. The 
basic idea of the static rank is reasonable - the more important sites refer (link) 
to a site, the more important is the site. There is a way to discriminate between 
"natural grown" link clusters and "artificial" ones (spam). 

In order to find a quantitative method which can discriminate between these two 

types of link clusters, one can introduce the ftmction which describes the statistical 
distribution of the relevancy Ri of the links, pointing to the document dj: 

<i>{Ri) = e ^ (23) 

here Rq is the average static rank of all sites, linking to the center of this cluster 
di- The parameter a defines the width of the distribution. 



The above mentioned types of clusters can be discriminated using the distribu- 
tion (f) - natural grown clusters contain links from an inhomogeneous set of sites, for 
example, the links to a site of a well known university will come from very small 
(amateur) sites of students, employees and alumnies (with a low page rank), via 
semi professional institutional sites (spin offs, research partners, ...) up to sites of 
other high ranked universities or institutes. The artificial link cluster consists of au- 
tomatically generated sites, each of them usually optimized for different keywords, 
but having approximately the same static rank. As a result of this it is possible to 
introduce a "cut ofF criteria based on formula (23). A cluster is most likely spam, 
if the condition 

^spam ^ ^critical (^'^) 

is fulfilled. Here (TcriUcai is an empirical parameter, which can be determined from 
the analysis of known natural and artificial clusters (or from the software generating 
the sites of the spam cluster). Estimates have shown that one can expect a result like 
(^natural » (^artificial- A short test example Can demonstrate this: the distribution 
of the page ranks of sites linking to the homepage of Steven Hawking [10] analyzed 
based on formula (23) have a width of cr^ = 1.1 , while the sites belonging to 
a typical spam cluster have a page rank distribution with ct^ — 0.5. ..0.7 ^. The 
parameter a can be used for separating between these two type of link clusters. 
The data of this example are based on the indications of the page rank indicator of 
Google's toolbar [12]. 

■^The data of this example are based on the page rank indicator of Google [12]. 
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